Building a Corpus for Japanese Wikification with Fine-Grained Entity Classes

نویسندگان

  • Davaajav Jargalsaikhan
  • Naoaki Okazaki
  • Koji Matsuda
  • Kentaro Inui
چکیده

In this research, we build a Wikification corpus for advancing Japanese Entity Linking. This corpus consists of 340 Japanese newspaper articles with 25,675 entity mentions. All entity mentions are labeled by a fine-grained semantic classes (200 classes), and 19,121 mentions were successfully linked to Japanese Wikipedia articles. Even with the fine-grained semantic classes, we found it hard to define the target of entity linking annotations and to utilize the fine-grained semantic classes to improve the accuracy of entity linking.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fine-grained Arabic named entity recognition

Named Entity Recognition (NER) is a Natural Language Processing (NLP) task, which aims to extract useful information from unstructured textual data by detecting and classifying Named Entity (NE) phrases into predefined semantic classes. This thesis addresses the problem of fine-grained NER for Arabic, which poses unique linguistic challenges to NER; such as the absence of capitalisation and sho...

متن کامل

Automatically Developing a Fine-grained Arabic Named Entity Corpus and Gazetteer by utilizing Wikipedia

This paper presents a methodology to exploit the potential of Arabic Wikipedia to assist in the automatic development of a large Fine-grained Named Entity (NE) corpus and gazetteer. The corner stone of this approach is efficient classification of Wikipedia articles to target NE classes. The resources developed were thoroughly evaluated to ensure reliability and a high quality. Results show the ...

متن کامل

Low-Complexity Heuristics for Deriving Fine-Grained Classes of Named Entities from Web Textual Data

We introduce a low-complexity method for acquiring fine-grained classes of named entities from the Web. The method exploits the large amounts of textual data available on the Web, while avoiding the use of any expensive text processing techniques or tools. The quality of the extracted classes is encouraging with respect to both the precision of the sets of named entities acquired within various...

متن کامل

Wikification for Scriptio Continua

The fact that Japanese employs scriptio continua, or a writing system without spaces, complicates the first step of an NLP pipeline. Word segmentation is widely used in Japanese language processing, and lexical knowledge is crucial for reliable identification of words in text. Although external lexical resources like Wikipedia are potentially useful, segmentation mismatch prevents them from bei...

متن کامل

Fine-grained Dutch named entity recognition

This paper describes the creation of a fine-grained named entity annotation scheme and corpus for Dutch, and experiments on automatic main type and subtype named entity recognition. We give an overview of existing named entity annotation schemes, and motivate our own, which describes six main types (persons, organizations, locations, products, events and miscellaneous named entities) and finer-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016